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Abstract 

Background: Patterns of genome-wide methylation vary between tissue types. For example, cancer tissue shows 
markedly different patterns from those of normal tissue. In this paper we propose a beta-mixture model to 
describe genome-wide methylation patterns based on probe data from methylation microarrays. The model takes 
dependencies between neighbour probe pairs into account and assumes three broad categories of methylation, 
low, medium and high. The model is described by 37 parameters, which reduces the dimensionality of a typical 
methylation microarray significantly. We used methylation microarray data from 42 colon cancer samples to assess 
the model. 

Results: Based on data from colon cancer samples we show that our model captures genome-wide characteristics 
of methylation patterns. We estimate the parameters of the model and show that they vary between different 
tissue types. Further, for each methylation probe the posterior probability of a methylation state (low, medium or 
high) is calculated and the probability that the state is correctly predicted is assessed. We demonstrate that the 
model can be applied to classify cancer tissue types accurately and that the model provides accessible and easily 
interpretable data summaries. 

Conclusions: We have developed a beta-mixture model for methylation microarray data. The model substantially 
reduces the dimensionality of the data. It can be used for further analysis, such as sample classification or to detect 
changes in methylation status between different samples and tissues. 



Background 

Interest in understanding the effects of epigenetics in 
relation to different complex diseases is increasing. One 
epigenetic mechanism of particular interest is DNA 
methylation at cytosines in CpG dinucleotides. The 
methylation patterns of genes may change and these 
alterations have been shown to be related to complex 
diseases, such as heart diseases [1], schizophrenia [2] 
and different cancers [3,4]. In cancer, several methyla- 
tion changes are detectable at the early stages of cancer 
or even in pre-cancerous tissues or blood [5,6]. In addi- 
tion, other methylation alterations have been shown to 
be specific to cancer type and stage [7,8]. 

High-throughput technologies, such as microarrays 
and large-scale sequencing, allow genome-wide 
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methylation measurements. Analysis of methylation data 
requires efficient statistical methods to be able to iden- 
tify potential methylation biomarkers and differential 
methylation patterns across sample types. Several meth- 
ods have been proposed to pinpoint significant methyla- 
tion differences in patients with cancer and to classify 
different tissue types. Examples include feature selection 
methods [9], mixed effect and generalized least square 
methods [10] and singular value decomposition-based 
methods [11]. In addition, methylation patterns of dis- 
tinct microarray probes have also been modeled with 
beta-mixture models, which subsequently are used with 
partitioning algorithms to separate different tissue types 
into clusters [12]. Each probe is modelled with its own 
beta-distribution that depends on the tissue type only. 
In the present paper, we propose a beta-mixture model 
to describe genome-wide methylation. We assume that 
methylation levels of nearby probes are dependent, as 
demonstrated previously by [13] and [14], and that 
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methylation can be categorized into three different 
broad categories or states, low, medium and high 
methylation. In addition, we include knowledge about 
the genomic background (proximity to CpG-islands) of 
the probes following suggestions in [15]. In total, our 
model has 37 parameters (see Section Data Analysis) 
per sample compared with approximately 27k probes in 
Illumina methylation arrays. The parameters comprise 
genomic background, methylation state and level, and 
dependency between probes. In short, the model facili- 
tates 

(i) reduction of the dimensionality of a methylation 
profile 

(ii) for each probe, computation of the posterior 
probability of a methylation state, 

(iii) computation of a posterior probability that the 
latter state was correctly predicted. 

We apply the model to a set of methylation microar- 
ray measurements from colon cancer samples and show 
that the model parameters reflect global patterns in the 
data. Based on the estimated parameters, we are able to 
classify the samples with high accuracy and to exhibit 
global differences between the cancer samples. Further- 
more, the model assigns a methylation state to each 
probe value. Using the states, accessible data summaries 
are provided. 

Results and Discussion 

The Methylation Array and the Number of CpGs in Probes 
in Different Genomic Regions 

The Illumina human methylation 27k array consists of 
27,578 probes that measure the methylation status of 
CpGs in the human genome at single nucleotide resolu- 
tion. The array measures genome-wide methylation and 
the probes target over 14,000 genes. The great majority 
of the genes included in the array have two methylation 
probes (80.9%), while 17.6% of the genes have one 
methylation probe and only 1.5% have more than two 
probes. The degree of a methylation of each probe is 
measured by the beta-value which is a continuous vari- 
able varying between zero and one, where one means 
full methylation. The probes of the Illumina human 
methylation 27k array are 50 nucleotides long and con- 
tain different numbers of CpGs. Probe locations can be 
divided into CpG islands (I), CpG island shore regions 
(within 2000 bp from CpG-islands) (S) and outside 
regions (O) (these definitions are adapted from [15]). 
About one quarter of the probes (6,150) are located in 
the shore regions, one sixth (4,922) in outside regions 
and the rest (13,819) in the CpG-islands. Depending on 
the region, methylation appears to happen at different 
rates such that CpG-islands are usually less methylated 



than the CpGs in other genomic regions [15]. Logically, 
this results in an uneven distribution of methylation in 
regions with different numbers of CpGs because more 
CpGs occur in CpG-islands than in outside regions; see 
Figure 1. As a consequence, the number of CpGs affects 
the amount of methylation. In fact, we found a negative 
correlation between the number of CpGs in a probe and 
the beta-value of the probe (Pearson's correlation coeffi- 
cient r = -0.28). However, this correlation is reduced 
markedly when studying the probes in I, S or O regions 
separately (correlation coefficients r = -0.049, r = -0.053 
and r = 0.083, respectively). The probe distance from 
the CpG-island in the O class showed no correlation 
with the strength of methylation. 

Parameter Estimates and Interpretation 

We used the proposed method to analyze methylation 
microarray data from normal and colon cancer tumor 
samples. Colon cancer can be divided into different 
types and here we study patients with microsatellite 
instable (MSI) and microsatellite stable (MSS) tumors. 
In addition, we have samples of benign colon adenomas 
that are not considered as cancer tumors but are classi- 
fied as MSS-type adenomas. Our data set contains 42 
Illumina methylation 27k microarray samples, divided 
into 6 normal, 6 adenoma, 6 MSI and 24 MSS-samples. 
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The basic idea behind the model is the assumption 
that the methylation level of a CpG (probe) can be 
divided into three different states, low (L), medium (M) 
and high (H) methylation. The three states are biologi- 
cally motivated in the following sense. L corresponds to 
the situation, where (almost) all cells in a sample are 
unmethylated and H to the situation, where (almost) all 
cells are methylated, irrespectively of the composition of 
the cells in the sample. M captures the situation in 
which the cells are only partly methylated (e.g. hemi- 
methylation), or some cell types in the sample are 
methylated while others are not. 

The latter could be the case if the sample consists of 
different cell types (sample purity) or one cell type 
shows heterogeneity, as would be expected for cancer 
cells. The three states are further empirically motivated 
(Figure 2). 

We assume that statistical properties of these different 
states are the same throughout the genome and that 
methylation of a CpG site depends on its location (I, S 
or O) in the genome in relation to the nearest CpG- 
island. We concentrate on modelling genes with two 
probes; however, our model can also be extended to 
genes with more than two methylation probes (for the 
genes with more than two probes the two first probes 
were used). The beta-values of a gene's probe pairs are 
dependent with a correlation coefficient of r = 0.668 
between the probes. A similar degree of correlation has 
been reported for other data sets, while the probes mea- 
suring methylation levels of different genes have shown 
no dependency [16]. We built a model that can be con- 
sidered a hidden Markov model (HMM) of probe pairs 
within a gene. The hidden states are the methylation 
levels, L, M and H. Further, the CpG probe pairs are 
classified by their locations into classes (1,1), (S,S), (0,0) 
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and (I,S). The latter includes both (I,S) and (S,I) pairs. 
Other cases were omitted as they contained only a few 
or no probe pairs. We built one model for each of the 
classes. 

We fitted a mixture of three beta-distributions (distri- 
butions corresponding to the low, medium and high 
methylation states, respectively) for each sample in the 
colon cancer data set, such that the beta-distribution cor- 
responding to the medium methylation state is symmetri- 
cal; as described in greater detail in Methods. The beta- 
distribution gives the density of the beta-value given the 
hidden state. That is, for each sample we estimate beta- 
distribution parameters a and [}, mixture proportions co 
and a transition probability matrix T for the HMM (see 
Methods for further details). The mixture proportions 
are the a priori probabilities that a probe is found in a 
given hidden state and the transition matrix gives the 
probabilities that a probe in some hidden state k x = L, M, 
or H, is followed by a probe in state k2 = L, M, or H. For 
the class (1,1) we set the mixture coefficient of the high 
methylation state to 0, i.e. this state could not be reached 
as high methylation appears to be very rare in this class. 
The same is assumed for the I probes in the class (I,S). 
Figure 2 shows the empirical and the fitted mixture dis- 
tributions for one MSI-sample for different classes. We 
also built a model with only two states for every class and 
a model where the high methylation state was included in 
all classes but these did not reflect the data properties 
equally well as the model presented. 

The mixture model parameters in different classes var- 
ied between samples, but some general trends in the 
sample groups (normal, adenoma, MSI, MSS) could be 
seen; see Figure 3 for examples. In the figure, the hierar- 
chial clustering of the samples based on the methylation 
array data (all probes) is also shown. The adenomas and 
the MSS cancers are mixed in two big clusters and the 
differences between these clusters can be seen in the 
parameters as well, cf. Figure 3 and 4. 

Regarding the mixture proportions, the group of MSI 
cancers is most easily discriminated from the other sam- 
ples (some examples are shown in Figure 3). Indeed, for 
the (1,1) probe pair class, the MSI mixture proportions 
for the low methylation state are clearly lower and the 
mixture proportions of the medium methylation state 
are higher than in the other classes. Similarly, but less 
clearly, differences for the shore (S) region probe pairs 
could be detected, e.g. medium and high methylation 
state mixture proportions are higher for MSI cancers 
compared with the other samples (results not shown). 
Furthermore, many mixture proportions vary in the 
same range when studying normals, adenomas and MSS 
cancers. However, normals have slightly higher propor- 
tions for low methylation in (1,1) class than the other 
samples. 



Laurila ef al. BMC Bioinformatics 201 1, 12:215 
http://www.biomedcentral.eom/1 471 -2 1 05/1 2/215 



Page 4 of 9 



o 

CM 



00 

d 



CD 

d 



d 



CM 

d 



o 
d 



_ ' r 

CO _ 

2 eg Jo 



CO C0C0C0C0 <C C0C0C0<W ^.^""O^^SntOrnmO) 

2 CO CO CO CO CO CO CO CO 5 COCOCO<<COCOCOCO^J"COCOb" 



cocoSm<< otc/3 

55^^ COCO 



CO CO 
CO CO 



z z z z z z 



* o x 9 




A A 
A A A 

+ 

+ 

+ 



o o 



A 

+ 



o o 



o o 
o o o 



X x x 









X 


X 


x x x 


X 




X 


A 






A 








A 


A A A 






A 








A 




+ 






+ 


+ 


+ 


+ A 


+ 



o o 



A A 
+ 



o o o o 



O n O o 



o o 



X x 



A A a 
A A A A A A A A 



+ + + ++ + 



+ + 



o o o 



A A A A A A 



+ + + 



Figure 3 Hierarchial clustering and mixture proportions. Top part: Hierarchial clustering of samples using methylation array data (all probes). 
Lower part: Four mixture proportions U>aia 2 ki. from different classes plotted in the same order as in the cluster dendrogram. N = normal, A = 
adenoma. L = low methylation state, M = medium methylation state. 



The probes in the outside regions show the biggest 
variations between the groups. Normals have small coef- 
ficients for low methylation whereas the high and med- 
ium methylation states are almost equally probable. On 
the contrary, for all tumor samples medium methylation 
is clearly the most evident and the proportions of the 
low and high methylation vary greatly. In addition, for 
some adenoma and MSS cancer samples, low methyla- 
tion was more probable than high methylation while in 
MSI cancer samples the low methylation state always 
had the smallest coefficient. Mixture proportions in the 
outside regions also distinguished the two clusters of 
adenomas and MSS cancers well. Overall, MSS cancers 
and adenomas share similar proportions in all the 
classes. Class (I,S) mixture proportions do not show as 
large differences between groups as other classes. 

We can see similar differences between groups in the 
estimated parameters of the beta-distribution as in the 
mixture proportions. We performed a principal com- 
ponent analysis (PCA) on the vector of beta-distribu- 
tion parameters (there are 13 for each sample) to 
illustrate how the parameters vary across samples and 
tissues. In Figure 4 all samples are plotted based on 
the three first principal components. Again, normals 
and MSI cancers are clustered into distinct groups 
while adenomas and MSS cancers overlap. However, 



the two clusters of the hierarchial clustering (Figure 3) 
encompassing adenomas and MSS cancers could be 
distinguished by the principal components. To further 
illustrate the use of the model we classified the sam- 
ples using the mixture proportions and a leave-one-out 
procedure (Equation 1 in Section Data Sets). We used 
as classes the four biggest clusters that can be seen in 
Figure 3 (clusters 1-4). These were obtained by hierar- 
chial clustering. The fifth cluster was omitted as it 
contained only two samples i.e., altogether 40 samples 
of the clusters 1-4 were used for the classification ana- 
lysis. Overall, 32 out of the remaining 40 samples 
(80%) were classified correctly, see Table 1. For com- 
parison, we followed the same procedure using beta- 
values from 100, 500, 1000, 2000, and 5000 probes and 
found similar classification results though with lower 
performance (see Additional file 1 Tables S6-S10). 
Using k-means only results generally in better perfor- 
mance: 82.5% using mixture proportions and 90% 
when using 5000 probes; for fewer probes the perfor- 
mance was less than when using the mixture propor- 
tions (see Additional file 1 Tables S1-S6 and Table 
Sll). Note that k-means find the best division of the 
samples into four clusters, whereas the leave-one-out 
method assumes clusters are defined and classifies the 
samples one at a time. 
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Figure 4 Principal component analysis on parameters of beta distributions Samples are plotted according to the three first principal 
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Data Summaries 

In this section we further illustrate the use of the model. 
For each probe pair, we computed the posterior mixture 
proportions and the most likely state. The most frequent 
state in a group was defined as the group's overall state. 
In Table 2, a summary of the results is shown. There 
are clear differences between groups which we expect 
from the previous analysis. Normal samples have almost 
twice as many HH probe pairs as the other groups. 
Likewise, MSI cancers have fewer probe pairs with low 
methylation states, while the medium methylation state 



Table 1 Sample Classification 



True cluster 


Total 




Predicted as 








1 


2 3 


4 


1 


6 


5 


1 0 


0 


2 


16 


1 


14 1 


0 


3 


12 


0 


4 8 


0 


4 


6 


0 


0 1 


5 



Classification based on leave-one-out cross validation. 



is far more common than in the other groups. The dis- 
tributions of methylation states for adenomas and MSS 
cancers resemble each other. 

We use the posterior mixture proportions to calcu- 
late the false annotation rate, FAR (see Section Data 
Analysis). FAR measures how often a probe pair is 
assigned to the wrong state: To each probe pair, we 



Table 2 Numbers of probe pairs 
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assign the hidden state (ki, k 2 ), with k\, k 2 = L, M, or 
H, with the highest posterior proportion. The probabil- 
ity that this assignment is incorrect is given by FAR 
(see Section Data Analysis). For the colon data set, 
FAR = 0.128, implying that about 1 in 8 probe pairs 
should have a wrong annotation. For example, in Fig- 
ure 2, the bottom left plot, wrong annotations are 
likely to occur when the probability of the low and the 
medium states are similar (e.g. beta-value around 0.1), 
while confident annotations are made when e.g. the 
beta-value is around 0.5. 

Next, we selected the probe pairs where the overall 
group state differed between groups. We further 
reduced the number of probe pairs by only choosing 
those for which the posterior mixture proportions 
showed a significant difference between groups using 
Fischer's linear discriminant analysis. Table 3 shows 
the results. For each probe pair, one or both probes 
might differ between two groups; in the table we 
count how many probes show a given change, e.g. 
L— >M. Most differences were found between normals 
and MSI cancers with almost 2000 differences while 
only 35 changed probes are detected between ade- 
noma and MSS samples. This again shows that based 
on methylation data, adenomas and MSS cancers are 
difficult to distinguish. In addition to the number of 
changes, also the type of changes differed between 
comparisons. 

For example, when comparing normals with MSI can- 
cers, over 80% of changes were from low to medium 
methylation. In comparison, between normals and ade- 
nomas and normals and MSS cancers, the proportions 
were 42% and 37%, respectively. 

The characteristics of the changed probes also differed 
between comparisons; see Table 4. For all comparisons 
except between normals and MSS cancers the majority 
of methylation changes happened at CpG-islands. 
Further, there are differences between adenomas and 
MSS cancers which may be used to distinguish between 
the two tissue types. 



Conclusions 

In this paper, we have proposed a model for microarray 
methylation data. The model uses four different probe 
pair classes and three different methylation states. It is 
motivated by the empirical distribution of beta-values 
and knowledge of the genomic content of CpG dinu- 
cleotides. It reduces the dimensionality of a microarray 
data set to 37, the number of parameters in the model. 
The model allows us to assign one of three broad 
classes (low, medium or high) to each methylation 
probe value and assess the correctness of the 
assignment. 

Further, we illustrate the use of the model by analys- 
ing a colon cancer data set. Normal and MSI samples 
could easily be distinguished from the other samples, 
but adenomas and MSS cancers were mixed together. 
However, the hierarchial clustering based on all beta- 
values (27k probes) also mixed these two groups. This 
suggests that the methylation patterns in adenomas and 
MSS cancers are very similar, which is in agreement 
with previous studies [17,18]. In addition, we identified 
differences in the genomic localisation of methylation 
changes. This observation may be used to discriminate 
between adenomas and MSS cancers from genome-wide 
methylation data. 

In the future it would be interesting to integrate 
information from different data sources, such as 
methylation, gene expression and copy numbers, into 
one model. It may also be beneficial to take the full 
step and model at the level of the DNA sequence 
directly, anticipating the rapidly growing interest in 
next-generation sequencing. 

Methods 

Data Sets and Preprocessing 

We used a data set that consists of 42 Illumina methyla- 
tion 27k microarray samples, that can be divided into 6 
normal, 6 adenoma, 6 MSI and 24 MSS-samples. Raw 
data was preprocessed and normalized with the back- 
ground method using Illumina's BeadStudio software. 



Table 3 Methylation state changes 





Norm vs Aden 


Norm vs MSI 


Norm vs MSS 


Aden vs MSI 


Aden vs MSS 


MSS vs MSI 


L -> M 


256 


1627 


613 


673 


25 


42 


M -> L 


173 


87 


344 


7 


8 


974 


M H 


9 


24 


24 


16 


2 


152 


H — > M 


168 


238 


694 


4 


0 


131 


L H 


0 


0 


0 


0 


0 


0 


H -> L 


1 


0 


0 


0 


0 


0 


Total 


606 


1977 


1675 


700 


35 


1299 



Number of changes in methylation states in group comparisons. Norm - normal, Aden = adenoma. 
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Table 4 Number of changes in different regions in group comparisons 
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After this the methylation value for each probe (the 
beta-value) was computed with the formula 

max(A4, 0) 



max(M, 0) + max( U, 0) + 100' 

where M is the value of the methylated bead type 
probe and U is the value of the unmethylated bead type 
probe. Beta-values vary between zero and one. In the 
analysis, we omit probes with [} = 0, these are generally 
of bad quality. 

Model 

Let Sj be the location (I, S or O ) of the /th probe in the 
genome and let x ; y be the beta-value computed for the 
probe / of sample i. Based on the locations, the data is 
divided into classes C a = {j\sj = a}, where a e {I, S, O}. 
For each class, we assume that the data follows a beta 
mixture model: 



f{Xij\j g C a ) 

E 



,<*<*— 1 



(1 



Here, k = L, M, H can be considered hidden 
(unknown) states and m akh the a priori probability that 
a probe from sample i and class a is in state k. The 
co aki 's are called mixture proportions for sample i and 
class a, and fulfill T, k a> aki = 1. If a probe is in hidden 
state k, it emits a methylation value according to a beta- 
distribution with parameter {a ak b j5 a ki) and normalizing 
constant B(a aki , /3 aki ). We assume that the beta-distribu- 
tion of the medium methylation state is symmetric, i.e. 
a flMi = PaMi- Throughout the paper k refers to the 
methylation state, k = L, M, H. 

To take dependencies between neighboring probe into 
account we do the following (for an illustration, see Fig- 
ure 5). For a probe pair (xy, ^y+i)), i.e. the first two 
probes in a gene, we model the two probes assuming 
Markov dependency between them, 



*i0+i)(j G C ai ,j + 1 € C ai ) = 

E l 0,^-1 

fe l£ {L,M,H) fllfel! 

fe 2 e{L,M,H) a °i h 2< 

^r i (i-^ + i ) )^- i ]= 

fei,fc 2 e(L,M,H) 

1 Cnj'^'" 1 




/urst probe in class C ai 



Second probe in class C a , 



Figure 5 Graphical view of the model The first probe of a probe 
pair belonging to class (a,, a 2 ) reaches the states L, M and H with 
probabilities <U fll(22 L, 0) aia2 M and CO aia2 H, respectively, and each 
state emits a methylation value from a corresponding beta- 
distribution (denoted in the figure by Beta ai fe). Sample index /' is 
suppressed for clarity. Then, transition to the state of the second 
probe happens according to the transition probabilities t dl a 2 k 1 k 2 
and similarly to the first probe, a methylation value is emitted from 
a beta-distribution. 
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Here, (a\, a 2 ) denotes the class of the probe pair, B ai k,i 
is short for the normalizing constant B(a ai k,i, Pa,k,i) of 
the corresponding beta-distribution, and 
v ai a 2 hM = Mamik-iiTa^hfai is the a priori probability 
(mixture proportion) that the probe pair has (hidden) 
methylation state (/<i, k 2 ). There are nine such hidden 
states. 

The bottom line in Equation 1 gives the density of the 
paired methylation values (xu, *,(/ + i)) as a mixture distri- 
bution over the nine hidden states. Given the probe pair 
is in hidden state (/q, k 2 ), the methylation values are 
emitted independently according to two beta distribu- 
tions. The parameters of the beta-distribution are 
assumed to depend only on the corresponding probe, 
not the probe pair, and again we assume symmetry for 
the medium methylation state, i.e. a a ,Mi = Pa,Mi> / = 1> 2. 
The middle line shows the same density written as a 
Hidden Markov Model. The first probe is in state 
with probability ;, while the second probe is in 

state k 2 with probability T ai a 2 } h j l2 i (given the first is in 
ki). Thus, Ta^l^Mii is a 3 x 3 transition matrix for each 
class («!, a 2 ) and sample i. 

If Equation 1 is marginalized to obtain the density for 
a single probe, we find Equation 1 with 
<w«2fci = Y^k 2 Va^kMiiKaav where ^a ia2 is the empirical fre- 
quency of class («!, a 2 ) among the probe pairs. Similarly 
for the second probe in the pair. 

Data Analysis 

Model parameters are estimated using maximum likeli- 
hood. Briefly, first the beta-distribution parameters are 
defined for each probe class (Cj, Cs, Co) and for each 
state (L,M,H) using R's optim-function and EM-algo- 
rithm, the parameters are obtained according to Equa- 
tion 1. Secondly, for each probe pair, the obtained beta- 
distribution parameters are used to estimate the mixture 
parameters ^a^kA and transition probabilities of the 
matrix T . In this step, the Baum-Welch algorithm is 
used. After the model estimation, parameters are 
obtained that include 13 beta-distribution paratemers 
(medium methylation distribution is symmetric, i.e., only 
one parameter is needed) for L, M and H states, 7 mix- 
ture proportion parameters for probe classes (1,1), (S,I), 
(S,S) and (0,0) (one for the class (1,1) as high methyla- 
tion cannot be obtained and two for the other classes) 
and 17 transition probabilities for the four probe pair 
classes (6 parameters for (S,S) and (0,0), 3 for (S,I) and 
2 for (1,1)). 

The most likely methylation states for each probe pair 
are computed with the Viterbi algorithm, in addition, 
we compute posterior probabilities for each possible 
state combination for each probe pair. We exclude the 
classes (S,0), (0,S), (1,0) and (0,1) from the analysis 



because there are very few probe pairs in these classes 
(121 in total). For convenience, we group (S,I) and (I,S) 
together by swapping the probes of the latter. In this 
way we are left with four classes, (1,1), (S,S), (0,0), and 
(S,D- 

For the classification of the samples using estimated 
parameters, we apply a leave-one-out method [19]. If 
there are G different groups, then sample i is classified 
as belonging to the group that minimizes the Mahanalo- 
bis distance 

where v f ,s, t = {a.\, a 2 , k\, k 2 ), are the mixture propor- 
tions of the different two probe classes, fi tg and er^ are 
the empirical mean and variance of the v t ,s over all sam- 
ples i in group g. If sample i belongs to group g, then it 
is left out when calculating the mean and the variance 
of that particular group. In addition, we performed the 
same procedure using beta-values from 100, 5000, 1000, 
2000, and 5000 probes. Also we used k-means with the 
same number of beta-values. These were selected as 
those having the largest variance among all the probes. 
Note that the first approach assumes we know the 
groups, while in the second approach k-means finds the 
optimal division of samples into four groups. 

We compute a false annotation rate (FAR) for the data 
which we define similarly to the false discovery and the 
false negative rates in [20]; that is, for a set / of probe 
pairs (with cardinality #/) the FAR is defined by 

¥AR=J2(1-Y{xj,x j+1 ))/#J, (2) 
where 

y[xj,Xj +1 )= max P(kik 2 \Xj,x j+ i) ( 3 ) 

fei,fe 2 e{L,M,H| 

is the posterior probability of the most probable state of 
probe pair (/', + 1) with methylation values {xj, Xj +1 ) and 

P{kik 1 \x j ,x j+ i), ki,k 2 <s {L,M,H}, (4) 

are the posterior mixture proportions given (xp Xj + i), 
calculated with the Viterbi algorithm. The FAR is a nat- 
ural measure here as it provides the posterior probability 
(i.e. given the methylation values) that a probe pair is 
classified as being in hidden state (k\, k 2 ), when in fact 
it is in (fe' 1; k' 2 ). 

We used Fischer's linear discriminant analysis [21] to 
test for differences in posterior mixture proportions 
between groups. To assess the significance of the test 
statistics we permuted group labels 10 000 times and 
redid the analysis. We used a significance level of 1%. 



Laurila ef al. BMC Bioinformatics 201 1, 12:215 Page 9 of 9 

http://www.biomedcentral.eom/1 471 -2 1 05/1 2/21 5 



Programs used for statistical analysis were written in R 
http://www.r-project.org/ and are available upon request. 

Additional material 



io. 



Additional file 1: Tables SI -SI 1 . Classification results using k-means 
clustering and the leave-one-out method with Mahalanobis distance. In 
Tables SI-SI 0, 100-5000 probes with the highest variance across samples 
were used in the analysis. Table S1 1 shows the results of k-means 
classification using the mixture proportions only (as in Table 1). Clusters 
are the same as in Figure 3 and 3 in Table 1. 
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